Feature Engineering¶
The main objectives for this notebook are:
- Develop a set of features that have a potential to improve your model's performance
- Investiage the relationships between your new features and your target
The skills that you need to showcase:
- Your domain expertise
- Your data wrangling skills
How to stand out?¶
- Engineer a well argued (if you have sources that's bonus point x2) feature
- Validate your features after engineering
- Don't use blind (auto) feature engineering - waste of time
- Design a feature engineering pipeline at the end of the notebook
Imports¶
import os
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.io as pio
import seaborn as sns
from feature_engine.selection import SmartCorrelatedSelection
import polars as pl
# Path needs to be added manually to read from another folder
path2add = os.path.normpath(
os.path.abspath(os.path.join(os.path.dirname("__file__"), os.path.pardir, "utils"))
)
if not (path2add in sys.path):
sys.path.append(path2add)
from feature_engineering import (
aggregate_node_features,
feature_predictive_power,
get_graph_features,
)
pio.renderers.default = "notebook"
data = pl.read_parquet('../data/supervised_clean_data.parquet')
calls = pl.read_json('../data/supervised_call_graphs.json')
data.head(1)
| _id | inter_api_access_duration(sec) | api_access_uniqueness | sequence_length(count) | vsession_duration(min) | ip_type | num_sessions | num_users | num_unique_apis | source | classification | is_anomaly | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | f64 | f64 | f64 | i64 | str | f64 | f64 | f64 | str | str | bool |
| 0 | "1f2c32d8-2d6e-… | 0.000812 | 0.004066 | 85.643243 | 5405 | "default" | 1460.0 | 1295.0 | 451.0 | "E" | "normal" | false |
calls.head(1)
| _id | call_graph |
|---|---|
| str | list[struct[2]] |
| "1f2c32d8-2d6e-… | [{"1f873432-6944-3df9-8300-8a3cf9f95b35","5862055b-35a6-316a-8e20-3ae20c1763c2"}, {"8955faa9-0e33-37ad-a1dc-f0e640a114c2","a4fd6415-1fd4-303e-aa33-bb1830b5d9d4"}, … {"016099ea-6f20-3fec-94cf-f7afa239f398","6fa8ad53-2f0d-3f44-8863-139092bfeda9"}] |
Since the main dataset already contains engineered features, there's not much opportunity to do feature engineering there. So, additional features will be created using the graph data that comes from supervised_call_graphs.json
Process Graph Data¶
calls_processed = (
calls.with_columns(
pl.col("call_graph").list.eval(
pl.element().struct.rename_fields(["from", "to"])
)
)
.explode("call_graph")
.unnest("call_graph")
)
calls_processed.head()
| _id | from | to |
|---|---|---|
| str | str | str |
| "1f2c32d8-2d6e-… | "1f873432-6944-… | "5862055b-35a6-… |
| "1f2c32d8-2d6e-… | "8955faa9-0e33-… | "a4fd6415-1fd4-… |
| "1f2c32d8-2d6e-… | "85754db8-6a55-… | "85754db8-6a55-… |
| "1f2c32d8-2d6e-… | "9f08fee1-953c-… | "876b4958-7df1-… |
| "1f2c32d8-2d6e-… | "857c4b20-3057-… | "857c4b20-3057-… |
Feature Engineering¶
We can see that each graph has a separate _id that can be later used to join to the main dataset. A graph consistst out of source and destination nodes which refer to the available API calls.
Basic Graph Level Features¶
The most basic graph-level that we can engineer are:
- Number of edges (connections)
- Number of nodes (APIs)
These features can be useful since most behaviours are going to have a "normal" range of APIs that they contact. If this number is too large or too small, this might be an indication of anomalous activity.
graph_features = calls_processed.group_by('_id').agg(
pl.len().alias('n_connections'),
pl.col('from'),
pl.col('to')
).with_columns(
pl.concat_list('from', 'to').list.unique().list.len().alias('n_unique_nodes')
).select([
'_id',
'n_connections',
'n_unique_nodes'
])
graph_features.sample(3)
| _id | n_connections | n_unique_nodes |
|---|---|---|
| str | u32 | u32 |
| "79c18974-2983-… | 68 | 31 |
| "ab6f299d-be1c-… | 12 | 10 |
| "5e8cc48d-d2bc-… | 17 | 10 |
Node Level Features¶
Since graphs consist out of nodes, we can engineer a set of features around specific nodes (APIs). We can calculate:
- Node degrees - the number of edges that come from/into a node. Very highly connected nodes can look anomalous.
- Node centrality - there are various centrality measures (e.g. Page Rank) but they all try to estimate how important to the whole graph is a specific node. This feature could be useful because a behaviour pattern that doesn't touch any of the "central" APIs would look anomalous
These features can be broken down into:
- global features - measure node attributes across all the graphs
- local features - measure node attributes across a specific graph
calls_processed = calls_processed.with_columns(
global_source_degrees = pl.len().over(pl.col('from')),
global_dest_degrees = pl.len().over(pl.col('to')),
local_source_degrees = pl.len().over(pl.col('from'), pl.col('_id')),
local_dest_degrees = pl.len().over(pl.col('to'), pl.col('_id'))
)
calls_processed.sample(3)
| _id | from | to | global_source_degrees | global_dest_degrees | local_source_degrees | local_dest_degrees |
|---|---|---|---|---|---|---|
| str | str | str | u32 | u32 | u32 | u32 |
| "290fe43b-8b93-… | "756ab2fe-a386-… | "27c07c16-5720-… | 6808 | 12223 | 2 | 3 |
| "ea0d02f5-ef61-… | "90a655af-9f52-… | "a449d369-17b1-… | 998 | 22013 | 1 | 18 |
| "f06fbc92-2a0e-… | "43dcab78-0f41-… | "1d768e1f-ee4c-… | 2885 | 1035 | 37 | 16 |
Now that the node-level features are calculated, we need to aggregate them for a specific graph (_id). When aggregating, we can calcualte average, std, min, and max statistics for every feature to capture the distribution well.
node_features_agg = aggregate_node_features(
calls_processed,
node_features=[
"global_source_degrees",
"global_dest_degrees",
"local_source_degrees",
"local_dest_degrees",
],
by="_id",
)
graph_features = graph_features.join(node_features_agg, on="_id")
graph_features.head()
| _id | n_connections | n_unique_nodes | avg_global_source_degrees | min_global_source_degrees | max_global_source_degrees | std_global_source_degrees | avg_global_dest_degrees | min_global_dest_degrees | max_global_dest_degrees | std_global_dest_degrees | avg_local_source_degrees | min_local_source_degrees | max_local_source_degrees | std_local_source_degrees | avg_local_dest_degrees | min_local_dest_degrees | max_local_dest_degrees | std_local_dest_degrees |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| str | u32 | u32 | f64 | u32 | u32 | f64 | f64 | u32 | u32 | f64 | f64 | u32 | u32 | f64 | f64 | u32 | u32 | f64 |
| "8c603b36-76b3-… | 60 | 27 | 5843.283333 | 63 | 32071 | 5330.261046 | 7802.75 | 273 | 22416 | 6417.05538 | 5.666667 | 1 | 15 | 5.586505 | 4.1 | 1 | 11 | 3.462633 |
| "538884b8-1a36-… | 11 | 10 | 4150.454545 | 55 | 32071 | 9742.337629 | 4079.272727 | 48 | 22013 | 8500.830066 | 1.363636 | 1 | 2 | 0.504525 | 1.545455 | 1 | 3 | 0.934199 |
| "460ab8c1-b9ec-… | 627 | 179 | 4908.614035 | 8 | 32071 | 8230.132013 | 5324.5311 | 8 | 22416 | 7059.300173 | 11.060606 | 1 | 48 | 12.578425 | 11.45933 | 1 | 38 | 11.202301 |
| "e6ab8dbb-bad2-… | 112 | 45 | 8288.330357 | 79 | 32071 | 9626.036163 | 10587.276786 | 65 | 22416 | 9107.784302 | 5.125 | 1 | 14 | 4.200225 | 7.660714 | 1 | 19 | 6.795008 |
| "ea024be4-a2cb-… | 62 | 33 | 241.064516 | 1 | 596 | 201.201951 | 407.919355 | 1 | 1151 | 430.800068 | 3.354839 | 1 | 8 | 2.389161 | 4.548387 | 1 | 11 | 3.67837 |
Feature Selection¶
Feature selection will be done using 2 steps:
- Quality checks - if the feature is constant or has too many missing values (>= 95%) it will be dropped
- Correlation analysis - if features have very high correlation (>= 95%) with each other, they can be dropped as well
engineered_features = graph_features.columns[1:]
engineered_features
['n_connections', 'n_unique_nodes', 'avg_global_source_degrees', 'min_global_source_degrees', 'max_global_source_degrees', 'std_global_source_degrees', 'avg_global_dest_degrees', 'min_global_dest_degrees', 'max_global_dest_degrees', 'std_global_dest_degrees', 'avg_local_source_degrees', 'min_local_source_degrees', 'max_local_source_degrees', 'std_local_source_degrees', 'avg_local_dest_degrees', 'min_local_dest_degrees', 'max_local_dest_degrees', 'std_local_dest_degrees']
Quality Checks¶
null_counts = graph_features.null_count().transpose(include_header=True, header_name='col', column_names=['null_count'])
null_counts.filter(pl.col('null_count') > 0)
| col | null_count |
|---|---|
| str | u32 |
| "std_global_sou… | 42 |
| "std_global_des… | 42 |
| "std_local_sour… | 42 |
| "std_local_dest… | 42 |
static_features = graph_features.select(engineered_features).std().transpose(include_header=True, header_name='col', column_names=['std'])
static_features.filter(pl.col('std') == 0)
| col | std |
|---|---|
| str | f64 |
Observations:
- No features have missing values or are static
Impact
- No features will be dropped for quality reasons
Correlation Analysis¶
As you can see, global degrees
feature_corrs = graph_features.select(engineered_features).corr().to_pandas()
feature_corrs.index = feature_corrs.columns
matrix = np.triu(feature_corrs)
fig = plt.figure(figsize=(20, 10))
sns.heatmap(feature_corrs, annot=True, mask=matrix)
<Axes: >
We can see clear groups of highyl correlated features. Hence, let's apply SmartCorrelatedSelection to reduce the feature set of engineered features
features_pd = graph_features.select(engineered_features).to_pandas().dropna()
tr = SmartCorrelatedSelection(
variables=None,
method="pearson",
threshold=0.95,
missing_values="raise",
selection_method="variance",
estimator=None,
)
tr.fit(features_pd)
print('Features to drop:')
for f in tr.features_to_drop_:
print(f)
Features to drop: n_unique_nodes std_global_dest_degrees avg_local_source_degrees max_local_source_degrees avg_local_dest_degrees max_local_dest_degrees std_local_dest_degrees
Observations:
- Engineered features have groups of high correlation
Impact
['n_unique_nodes', 'std_global_dest_degrees', 'avg_local_source_degrees', 'max_local_source_degrees', 'avg_local_dest_degrees', 'max_local_dest_degrees' 'std_local_dest_degrees']are dropped from the features list due to belonging to a high correlation set and having lower variance than the remaining feature
EDA for Remaining Engineered Features¶
remaining_engineered_features = list(set(features_pd).difference(set(tr.features_to_drop_)))
graph_features = graph_features.join(data.select(['_id', 'is_anomaly']), on='_id')
scores = []
for f in remaining_engineered_features:
print("Feature Analysis:", f)
score = feature_predictive_power(graph_features, f, "is_anomaly")
scores.append(score)
Feature Analysis: std_global_source_degrees Predictive Power Score: 0.44369998574256897
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: min_global_source_degrees Predictive Power Score: 0.5494999885559082
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: max_global_dest_degrees Predictive Power Score: 0.5921000242233276
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: avg_global_source_degrees Predictive Power Score: 0.328900009393692
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: min_local_dest_degrees Predictive Power Score: 0.007799999788403511
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: max_global_source_degrees
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Predictive Power Score: 0.36739999055862427
Feature Analysis: avg_global_dest_degrees
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Predictive Power Score: 0.3370000123977661
Feature Analysis: min_global_dest_degrees Predictive Power Score: 0.5932000279426575
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: n_connections Predictive Power Score: 0.5871999859809875
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: std_local_source_degrees Predictive Power Score: 0.5327000021934509
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Feature Analysis: min_local_source_degrees Predictive Power Score: 0.0
/Users/antonsruberts/miniconda/envs/dev/lib/python3.10/site-packages/plotly/express/_core.py:1985: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
pd.Series(scores, index=remaining_engineered_features).sort_values(ascending=False)
min_global_dest_degrees 0.5932 max_global_dest_degrees 0.5921 n_connections 0.5872 min_global_source_degrees 0.5495 std_local_source_degrees 0.5327 std_global_source_degrees 0.4437 max_global_source_degrees 0.3674 avg_global_dest_degrees 0.3370 avg_global_source_degrees 0.3289 min_local_dest_degrees 0.0078 min_local_source_degrees 0.0000 dtype: float32
Observations:
- Most of the engineered features have relatively highe predictiveness score
- The most predictive features are
global - Features with no predictive power measure minimum degrees of local graphs
- Relationships between engineered features and the target are not-linear
Impact
min_local_dest_degreesandmin_local_source_degreescan be dropped- Tree based models need to be used to capture the engineered relationships
remaining_engineered_features = [f for f in remaining_engineered_features if f not in ['min_local_dest_degrees', 'min_local_source_degrees']]
print('Final engineered featureset:')
print(remaining_engineered_features)
Final engineered featureset: ['std_global_source_degrees', 'min_global_source_degrees', 'max_global_dest_degrees', 'avg_global_source_degrees', 'max_global_source_degrees', 'avg_global_dest_degrees', 'min_global_dest_degrees', 'n_connections', 'std_local_source_degrees']
Feature Engineering Pipeline¶
selected_features = [
"max_global_source_degrees",
"avg_global_source_degrees",
"min_global_dest_degrees",
"std_local_source_degrees",
"max_global_dest_degrees",
"min_global_source_degrees",
"std_global_source_degrees",
"n_connections",
"avg_global_dest_degrees",
]
calls = (
(
pl.read_json("../data/supervised_call_graphs.json")
.with_columns(
pl.col("call_graph").list.eval(
pl.element().struct.rename_fields(["from", "to"])
)
)
.explode("call_graph")
.unnest("call_graph")
)
.with_columns(
global_source_degrees=pl.len().over(pl.col("from")),
global_dest_degrees=pl.len().over(pl.col("to")),
local_source_degrees=pl.len().over(pl.col("from"), pl.col("_id")),
local_dest_degrees=pl.len().over(pl.col("to"), pl.col("_id")),
)
.pipe(get_graph_features)
.select(["_id"] + selected_features)
)
pl.read_parquet("../data/supervised_clean_data.parquet").join(
calls, on="_id"
).write_parquet("../data/supervised_clean_data_w_features.parquet")
Summary¶
Feature Engineering Summary¶
- 18 new features were engineered which have measured graph and node related features
- Graph-level features measure the total size of the graphs
- Node level features measure the degrees on global and local levels
- 7 features were dropped due to high correlation within group
- 2 more features were dropped due to low predictive power score
Implications for ML¶
- Engineered and selected 9 features are theorised to be useful in the prediction task so should be included into the final model
- Feature engineering pipeline was designed, so new data can be easily transformed